Smart reindexing

Gregory Kozlovsky

Web sites change. Some change every minute, some once a year. To keep up with site changes, index databases must be periodically renewed.

The simplest way to renew an index database is to erase the old database and run the spider again to create the new one. However, this approach, while requiring less sophisticated software, leads to serious limitations.

The first and probably the least problem with the above approach is that the search service has to be interrupted for new indexing. This problem can be circumvented at a cost of additional disk space and increased download volume by creating two alternative indexes and switching between them.

Another problem is index instability. Web sites, as seen from a client side, change not only intentionally, but also underwent temporary changes as a result of mistakes, software and hardware problems. It may happen, that an indexed site or a part of it was unreachable during scheduled indexing time. This can happen in many different ways: network failure, DNS failure, temporarily broken or misconfigured links with various HTTP codes returned, sometimes even incorrectly generated pages without all the intended links. The site webmaster fixes the problem shortly thereafter, the site is online, but the search is not working properly till the next reindexing. Site users do not know about the temporary failure that caused the situation and complain about a low quality of search service.

The most important problem that cannot be solved without reindexing is keeping reliable file creation and modification dates. Many users are interested in getting search results sorted by the last modification time or restricting their search to a particular time period. In modern internet most pages are dynamically generated and the HTTP Last-Modified entity-header field is either missing or always contains the current time that makes it useless. Some pages may contain the correct creation and modification times in the meta tags, but this can be relied upon only in specific cases.

The only reliable way to solve this problem, at least partially, is to keep a permanent index database. In this case, the document creation time can be set to the time the document was first downloaded and the document last modification date can be modified when the current version of the document downloaded from the site has a different CRC checksum compared to the cached copy of the document.

Reindexing algorithm

While reindexing the spider tries to get an advantage of the fact that the site may not have changed much from the time of the last spidering. The HTTP protocol does provide conditional downloading request that returns document body only if the document did change since the time of the last download. If the document did not change, only an appropriate HTTP header is returned. This feature can potentially save a lot of bandwidth and processing costs. Unfortunately, many if not most servers on the current Internet are not properly configured to use this feature. Another possibility to detect unchanged documents and save on processing time (but not on downloading) is to compare the CRC checksum of the downloaded document with the CRC checksum of the cached copy of the document.

During reindexing spider mimics the original indexing. Starting paths specified in the sets.cnf configuration file are loaded into the indexing queue and the links are followed recursively. If a document is unchanged, its last indexing and reindexing times are modified and its link handles are read from the database and loaded into the indexing queue.

Documents are considered unchanged if:

Unchanged documents stay unchanged in the database and their keywords and links are not written into the spider journal. Changed documents are processed the same way as during the original indexing.

During reindexing, search will still be available. However, there will be minor inconsistencies. The reverse index does not change during the spider run, it is updated only when the spider journal is merged with the reverse index. Therefore, before spider journal merge is completed, a document will still be found only via old keywords that may no longer be present in the document on the site or in the cached copy of the document. The documents that are removed from the site or are no longer reachable via links, will still appear in the search results, even after the spider journal is merged with the reverse index, till the time they are deleted with the special tool.

Orphans and deleted documents

During reindexing some of the documents stored in the database during one of the previous spider runs will be never reached via links and therefore not updated. We will call these documents orphans. Some of the orphans may still be available on the site. They can be found using the fact that their last indexing time is less than the starting time of the last spider run. Documents may become orphans because of some temporary problem on the site during reindexing and will stop being orphans during the next reindexing. Documents that remain orphans for any given number of spider runs can be found the same way, by their last indexing time.

locust provides a tool to delete orphans with the last indexing time earlier than a specified time period before the current time.

With our reindexing algorithm, the documents that are no longer in a document set described in the sets.cnf configuration file because of a configuration change, will also become orphans after next reindexing even when they are still on the site. This documents can be identified and safely removed the same way as other orphans.

Practical considerations

In an ideal world without network, software and hardware problems and human mistakes, orphans should be removed immediately after reindexing. Because of considerations of index stability described in the introduction, immediate orphan removal may not be desirable. every search administrator have to decide

A decision on when to do orphan cleaning cannot done without considering a specific situation. If the search administrator knows about large changes done on the site, he or she can decide to clean orphans immediately after reindexing. In case of indexing a site with frequent timeouts, it made be desirable to clean orphans only after they stay orphans during several reindexing attempts.